In [1]:
%pylab inline


Populating the interactive namespace from numpy and matplotlib

Reading texts

In this notebook we will read a single text file, and experiment with tokenization, normalization, and filtering.


In [2]:
import nltk
import matplotlib.pyplot as plt

Read a file from disk


In [3]:
# Try changing this path to point to a different text file.
raw_text = nltk.load('../data/lorem.txt')

In [4]:
raw_text


Out[4]:
u'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque non lacinia velit, quis scelerisque orci. Fusce ac nisi sit amet justo porta aliquam. Fusce ac arcu ut diam imperdiet hendrerit. Integer vitae finibus lacus. Mauris pellentesque, ipsum eget scelerisque molestie, diam est molestie arcu, a blandit nulla eros ut dui. Pellentesque odio massa, rutrum nec magna ac, viverra pulvinar magna. Duis ultricies, ligula id vulputate eleifend, metus leo semper ante, sed suscipit lacus odio non justo.\n\nInterdum et malesuada fames ac ante ipsum primis in faucibus. Donec a libero et nisl tincidunt dignissim in et arcu. Nam eleifend mi ut sodales suscipit. Aenean sit amet nunc tincidunt, facilisis purus et, porta dui. Maecenas id tortor fusce malesuada sem accumsan feugiat. Pellentesque molestie velit nec arcu consectetur, sed tempor tellus varius. In hac habitasse platea dictumst. Aliquam erat volutpat. Sed placerat varius purus eget semper.\n\nFusce quis risus eu purus faucibus dignissim. Morbi facilisis dolor sed purus tincidunt, at eleifend lorem iaculis. Maecenas sed lobortis nunc. Nullam sollicitudin finibus imperdiet. Nam pharetra ex justo, non elementum massa molestie a. Vivamus quis ipsum nulla. Quisque leo nisl, luctus in ultrices id, malesuada id purus. Ut laoreet justo eget elementum tempus.\n\nQuisque vitae aliquet quam, non accumsan felis. Sed nec lectus ante. Phasellus mi diam, facilisis eu vulputate at, vulputate blandit fusce urna. Fusce iaculis pharetra accumsan. In hac habitasse platea dictumst. Mauris vel massa felis. Fusce congue et tellus vitae condimentum. Cras lacinia dignissim mollis. Donec pellentesque eleifend purus, non varius tortor rhoncus non. Nulla tempor, nisl a fermentum tristique, orci dui ultricies velit, at pretium nibh justo a sem. In faucibus enim lacinia sem pretium pretium eu ut lectus. Nullam consequat orci ac odio ultrices, eu hendrerit nisl eleifend. Nulla elit nulla, finibus at sem nec, faucibus ultricies leo. Nunc ante elit, laoreet non magna at, laoreet ultricies ipsum.\n\nInteger vel nulla hendrerit, lacinia sem a, fusce pellentesque lorem. Aliquam ullamcorper dapibus fermentum. Integer a mi sed ante faucibus efficitur non in metus. Nunc condimentum felis at consequat vulputate. Donec fusce bibendum rutrum diam. Nunc et nunc et magna sodales ultrices at feugiat ipsum. Sed tincidunt urna vel lectus dignissim dictum.'

Tokenization

Tokenization is the process of splitting up a text into "tokens". For our purposes, the tokens that we are interested in are usually words.

The simplest approach might be to split the text wherever we find a space. This can work just fine, but there are several undesirable artefacts; newlines and punctuation are kept.


In [5]:
raw_text.split(' ')


Out[5]:
[u'Lorem',
 u'ipsum',
 u'dolor',
 u'sit',
 u'amet,',
 u'consectetur',
 u'adipiscing',
 u'elit.',
 u'Pellentesque',
 u'non',
 u'lacinia',
 u'velit,',
 u'quis',
 u'scelerisque',
 u'orci.',
 u'Fusce',
 u'ac',
 u'nisi',
 u'sit',
 u'amet',
 u'justo',
 u'porta',
 u'aliquam.',
 u'Fusce',
 u'ac',
 u'arcu',
 u'ut',
 u'diam',
 u'imperdiet',
 u'hendrerit.',
 u'Integer',
 u'vitae',
 u'finibus',
 u'lacus.',
 u'Mauris',
 u'pellentesque,',
 u'ipsum',
 u'eget',
 u'scelerisque',
 u'molestie,',
 u'diam',
 u'est',
 u'molestie',
 u'arcu,',
 u'a',
 u'blandit',
 u'nulla',
 u'eros',
 u'ut',
 u'dui.',
 u'Pellentesque',
 u'odio',
 u'massa,',
 u'rutrum',
 u'nec',
 u'magna',
 u'ac,',
 u'viverra',
 u'pulvinar',
 u'magna.',
 u'Duis',
 u'ultricies,',
 u'ligula',
 u'id',
 u'vulputate',
 u'eleifend,',
 u'metus',
 u'leo',
 u'semper',
 u'ante,',
 u'sed',
 u'suscipit',
 u'lacus',
 u'odio',
 u'non',
 u'justo.\n\nInterdum',
 u'et',
 u'malesuada',
 u'fames',
 u'ac',
 u'ante',
 u'ipsum',
 u'primis',
 u'in',
 u'faucibus.',
 u'Donec',
 u'a',
 u'libero',
 u'et',
 u'nisl',
 u'tincidunt',
 u'dignissim',
 u'in',
 u'et',
 u'arcu.',
 u'Nam',
 u'eleifend',
 u'mi',
 u'ut',
 u'sodales',
 u'suscipit.',
 u'Aenean',
 u'sit',
 u'amet',
 u'nunc',
 u'tincidunt,',
 u'facilisis',
 u'purus',
 u'et,',
 u'porta',
 u'dui.',
 u'Maecenas',
 u'id',
 u'tortor',
 u'fusce',
 u'malesuada',
 u'sem',
 u'accumsan',
 u'feugiat.',
 u'Pellentesque',
 u'molestie',
 u'velit',
 u'nec',
 u'arcu',
 u'consectetur,',
 u'sed',
 u'tempor',
 u'tellus',
 u'varius.',
 u'In',
 u'hac',
 u'habitasse',
 u'platea',
 u'dictumst.',
 u'Aliquam',
 u'erat',
 u'volutpat.',
 u'Sed',
 u'placerat',
 u'varius',
 u'purus',
 u'eget',
 u'semper.\n\nFusce',
 u'quis',
 u'risus',
 u'eu',
 u'purus',
 u'faucibus',
 u'dignissim.',
 u'Morbi',
 u'facilisis',
 u'dolor',
 u'sed',
 u'purus',
 u'tincidunt,',
 u'at',
 u'eleifend',
 u'lorem',
 u'iaculis.',
 u'Maecenas',
 u'sed',
 u'lobortis',
 u'nunc.',
 u'Nullam',
 u'sollicitudin',
 u'finibus',
 u'imperdiet.',
 u'Nam',
 u'pharetra',
 u'ex',
 u'justo,',
 u'non',
 u'elementum',
 u'massa',
 u'molestie',
 u'a.',
 u'Vivamus',
 u'quis',
 u'ipsum',
 u'nulla.',
 u'Quisque',
 u'leo',
 u'nisl,',
 u'luctus',
 u'in',
 u'ultrices',
 u'id,',
 u'malesuada',
 u'id',
 u'purus.',
 u'Ut',
 u'laoreet',
 u'justo',
 u'eget',
 u'elementum',
 u'tempus.\n\nQuisque',
 u'vitae',
 u'aliquet',
 u'quam,',
 u'non',
 u'accumsan',
 u'felis.',
 u'Sed',
 u'nec',
 u'lectus',
 u'ante.',
 u'Phasellus',
 u'mi',
 u'diam,',
 u'facilisis',
 u'eu',
 u'vulputate',
 u'at,',
 u'vulputate',
 u'blandit',
 u'fusce',
 u'urna.',
 u'Fusce',
 u'iaculis',
 u'pharetra',
 u'accumsan.',
 u'In',
 u'hac',
 u'habitasse',
 u'platea',
 u'dictumst.',
 u'Mauris',
 u'vel',
 u'massa',
 u'felis.',
 u'Fusce',
 u'congue',
 u'et',
 u'tellus',
 u'vitae',
 u'condimentum.',
 u'Cras',
 u'lacinia',
 u'dignissim',
 u'mollis.',
 u'Donec',
 u'pellentesque',
 u'eleifend',
 u'purus,',
 u'non',
 u'varius',
 u'tortor',
 u'rhoncus',
 u'non.',
 u'Nulla',
 u'tempor,',
 u'nisl',
 u'a',
 u'fermentum',
 u'tristique,',
 u'orci',
 u'dui',
 u'ultricies',
 u'velit,',
 u'at',
 u'pretium',
 u'nibh',
 u'justo',
 u'a',
 u'sem.',
 u'In',
 u'faucibus',
 u'enim',
 u'lacinia',
 u'sem',
 u'pretium',
 u'pretium',
 u'eu',
 u'ut',
 u'lectus.',
 u'Nullam',
 u'consequat',
 u'orci',
 u'ac',
 u'odio',
 u'ultrices,',
 u'eu',
 u'hendrerit',
 u'nisl',
 u'eleifend.',
 u'Nulla',
 u'elit',
 u'nulla,',
 u'finibus',
 u'at',
 u'sem',
 u'nec,',
 u'faucibus',
 u'ultricies',
 u'leo.',
 u'Nunc',
 u'ante',
 u'elit,',
 u'laoreet',
 u'non',
 u'magna',
 u'at,',
 u'laoreet',
 u'ultricies',
 u'ipsum.\n\nInteger',
 u'vel',
 u'nulla',
 u'hendrerit,',
 u'lacinia',
 u'sem',
 u'a,',
 u'fusce',
 u'pellentesque',
 u'lorem.',
 u'Aliquam',
 u'ullamcorper',
 u'dapibus',
 u'fermentum.',
 u'Integer',
 u'a',
 u'mi',
 u'sed',
 u'ante',
 u'faucibus',
 u'efficitur',
 u'non',
 u'in',
 u'metus.',
 u'Nunc',
 u'condimentum',
 u'felis',
 u'at',
 u'consequat',
 u'vulputate.',
 u'Donec',
 u'fusce',
 u'bibendum',
 u'rutrum',
 u'diam.',
 u'Nunc',
 u'et',
 u'nunc',
 u'et',
 u'magna',
 u'sodales',
 u'ultrices',
 u'at',
 u'feugiat',
 u'ipsum.',
 u'Sed',
 u'tincidunt',
 u'urna',
 u'vel',
 u'lectus',
 u'dignissim',
 u'dictum.']

A better way is to split on any whitespace, and to separate punctuation from adjacent tokens. The nltk package has a tokenizer called the Penn Treebank Tokenizer that does the job.


In [6]:
# Shortcut for Penn Treebank Tokenizer.
from nltk import word_tokenize

In [7]:
tokens = word_tokenize(raw_text)

In [8]:
tokens


Out[8]:
[u'Lorem',
 u'ipsum',
 u'dolor',
 u'sit',
 u'amet',
 u',',
 u'consectetur',
 u'adipiscing',
 u'elit',
 u'.',
 u'Pellentesque',
 u'non',
 u'lacinia',
 u'velit',
 u',',
 u'quis',
 u'scelerisque',
 u'orci',
 u'.',
 u'Fusce',
 u'ac',
 u'nisi',
 u'sit',
 u'amet',
 u'justo',
 u'porta',
 u'aliquam',
 u'.',
 u'Fusce',
 u'ac',
 u'arcu',
 u'ut',
 u'diam',
 u'imperdiet',
 u'hendrerit',
 u'.',
 u'Integer',
 u'vitae',
 u'finibus',
 u'lacus',
 u'.',
 u'Mauris',
 u'pellentesque',
 u',',
 u'ipsum',
 u'eget',
 u'scelerisque',
 u'molestie',
 u',',
 u'diam',
 u'est',
 u'molestie',
 u'arcu',
 u',',
 u'a',
 u'blandit',
 u'nulla',
 u'eros',
 u'ut',
 u'dui',
 u'.',
 u'Pellentesque',
 u'odio',
 u'massa',
 u',',
 u'rutrum',
 u'nec',
 u'magna',
 u'ac',
 u',',
 u'viverra',
 u'pulvinar',
 u'magna',
 u'.',
 u'Duis',
 u'ultricies',
 u',',
 u'ligula',
 u'id',
 u'vulputate',
 u'eleifend',
 u',',
 u'metus',
 u'leo',
 u'semper',
 u'ante',
 u',',
 u'sed',
 u'suscipit',
 u'lacus',
 u'odio',
 u'non',
 u'justo',
 u'.',
 u'Interdum',
 u'et',
 u'malesuada',
 u'fames',
 u'ac',
 u'ante',
 u'ipsum',
 u'primis',
 u'in',
 u'faucibus',
 u'.',
 u'Donec',
 u'a',
 u'libero',
 u'et',
 u'nisl',
 u'tincidunt',
 u'dignissim',
 u'in',
 u'et',
 u'arcu',
 u'.',
 u'Nam',
 u'eleifend',
 u'mi',
 u'ut',
 u'sodales',
 u'suscipit',
 u'.',
 u'Aenean',
 u'sit',
 u'amet',
 u'nunc',
 u'tincidunt',
 u',',
 u'facilisis',
 u'purus',
 u'et',
 u',',
 u'porta',
 u'dui',
 u'.',
 u'Maecenas',
 u'id',
 u'tortor',
 u'fusce',
 u'malesuada',
 u'sem',
 u'accumsan',
 u'feugiat',
 u'.',
 u'Pellentesque',
 u'molestie',
 u'velit',
 u'nec',
 u'arcu',
 u'consectetur',
 u',',
 u'sed',
 u'tempor',
 u'tellus',
 u'varius',
 u'.',
 u'In',
 u'hac',
 u'habitasse',
 u'platea',
 u'dictumst',
 u'.',
 u'Aliquam',
 u'erat',
 u'volutpat',
 u'.',
 u'Sed',
 u'placerat',
 u'varius',
 u'purus',
 u'eget',
 u'semper',
 u'.',
 u'Fusce',
 u'quis',
 u'risus',
 u'eu',
 u'purus',
 u'faucibus',
 u'dignissim',
 u'.',
 u'Morbi',
 u'facilisis',
 u'dolor',
 u'sed',
 u'purus',
 u'tincidunt',
 u',',
 u'at',
 u'eleifend',
 u'lorem',
 u'iaculis',
 u'.',
 u'Maecenas',
 u'sed',
 u'lobortis',
 u'nunc',
 u'.',
 u'Nullam',
 u'sollicitudin',
 u'finibus',
 u'imperdiet',
 u'.',
 u'Nam',
 u'pharetra',
 u'ex',
 u'justo',
 u',',
 u'non',
 u'elementum',
 u'massa',
 u'molestie',
 u'a.',
 u'Vivamus',
 u'quis',
 u'ipsum',
 u'nulla',
 u'.',
 u'Quisque',
 u'leo',
 u'nisl',
 u',',
 u'luctus',
 u'in',
 u'ultrices',
 u'id',
 u',',
 u'malesuada',
 u'id',
 u'purus',
 u'.',
 u'Ut',
 u'laoreet',
 u'justo',
 u'eget',
 u'elementum',
 u'tempus',
 u'.',
 u'Quisque',
 u'vitae',
 u'aliquet',
 u'quam',
 u',',
 u'non',
 u'accumsan',
 u'felis',
 u'.',
 u'Sed',
 u'nec',
 u'lectus',
 u'ante',
 u'.',
 u'Phasellus',
 u'mi',
 u'diam',
 u',',
 u'facilisis',
 u'eu',
 u'vulputate',
 u'at',
 u',',
 u'vulputate',
 u'blandit',
 u'fusce',
 u'urna',
 u'.',
 u'Fusce',
 u'iaculis',
 u'pharetra',
 u'accumsan',
 u'.',
 u'In',
 u'hac',
 u'habitasse',
 u'platea',
 u'dictumst',
 u'.',
 u'Mauris',
 u'vel',
 u'massa',
 u'felis',
 u'.',
 u'Fusce',
 u'congue',
 u'et',
 u'tellus',
 u'vitae',
 u'condimentum',
 u'.',
 u'Cras',
 u'lacinia',
 u'dignissim',
 u'mollis',
 u'.',
 u'Donec',
 u'pellentesque',
 u'eleifend',
 u'purus',
 u',',
 u'non',
 u'varius',
 u'tortor',
 u'rhoncus',
 u'non',
 u'.',
 u'Nulla',
 u'tempor',
 u',',
 u'nisl',
 u'a',
 u'fermentum',
 u'tristique',
 u',',
 u'orci',
 u'dui',
 u'ultricies',
 u'velit',
 u',',
 u'at',
 u'pretium',
 u'nibh',
 u'justo',
 u'a',
 u'sem',
 u'.',
 u'In',
 u'faucibus',
 u'enim',
 u'lacinia',
 u'sem',
 u'pretium',
 u'pretium',
 u'eu',
 u'ut',
 u'lectus',
 u'.',
 u'Nullam',
 u'consequat',
 u'orci',
 u'ac',
 u'odio',
 u'ultrices',
 u',',
 u'eu',
 u'hendrerit',
 u'nisl',
 u'eleifend',
 u'.',
 u'Nulla',
 u'elit',
 u'nulla',
 u',',
 u'finibus',
 u'at',
 u'sem',
 u'nec',
 u',',
 u'faucibus',
 u'ultricies',
 u'leo',
 u'.',
 u'Nunc',
 u'ante',
 u'elit',
 u',',
 u'laoreet',
 u'non',
 u'magna',
 u'at',
 u',',
 u'laoreet',
 u'ultricies',
 u'ipsum',
 u'.',
 u'Integer',
 u'vel',
 u'nulla',
 u'hendrerit',
 u',',
 u'lacinia',
 u'sem',
 u'a',
 u',',
 u'fusce',
 u'pellentesque',
 u'lorem',
 u'.',
 u'Aliquam',
 u'ullamcorper',
 u'dapibus',
 u'fermentum',
 u'.',
 u'Integer',
 u'a',
 u'mi',
 u'sed',
 u'ante',
 u'faucibus',
 u'efficitur',
 u'non',
 u'in',
 u'metus',
 u'.',
 u'Nunc',
 u'condimentum',
 u'felis',
 u'at',
 u'consequat',
 u'vulputate',
 u'.',
 u'Donec',
 u'fusce',
 u'bibendum',
 u'rutrum',
 u'diam',
 u'.',
 u'Nunc',
 u'et',
 u'nunc',
 u'et',
 u'magna',
 u'sodales',
 u'ultrices',
 u'at',
 u'feugiat',
 u'ipsum',
 u'.',
 u'Sed',
 u'tincidunt',
 u'urna',
 u'vel',
 u'lectus',
 u'dignissim',
 u'dictum',
 u'.']

In most of the workflows that we will use in this course, it is typical to represent a document as a list of tokens like the one above.

NTLK provides a representation called Text that provides some helpful tools for exploring your document. It is basically just a wrapper around the tokens.


In [9]:
text = nltk.Text(tokens)

We can find all of the occurrences of a token in the document using the concordance() method.


In [10]:
text.concordance('velit')


Displaying 3 of 3 matches:
                                     velit , quis scelerisque orci . Fusce ac n
msan feugiat . Pellentesque molestie velit nec arcu consectetur , sed tempor te
entum tristique , orci dui ultricies velit , at pretium nibh justo a sem . In f

Similarly, the dispersion_plot() shows the relative position in which specific tokens occur.


In [11]:
text.dispersion_plot(['ipsum', 'orci', 'sem', 'justo', 'arcu', 'Fusce', 'fusce'])


We can plot the frequency of the top 20 most common tokens in the text using the plot() method.


In [12]:
text.plot(20)


Normalization

Notice in the dispersion plot example, above, that Fusce and fusce are treated as distinct tokens. The computer does not know that Fusce and fusce are the same -- one starts with an F, and the other starts with an f, which are two entirely different characters.

Normalization is the process of transforming tokens into a standardized representation. The simplest form of normalization is to convert all tokens to lower case, which we can do prior to tokenization using the lower() method.


In [13]:
print raw_text.lower()[:500]    # Only show the first 500 characters (0 - 499) of the string.


lorem ipsum dolor sit amet, consectetur adipiscing elit. pellentesque non lacinia velit, quis scelerisque orci. fusce ac nisi sit amet justo porta aliquam. fusce ac arcu ut diam imperdiet hendrerit. integer vitae finibus lacus. mauris pellentesque, ipsum eget scelerisque molestie, diam est molestie arcu, a blandit nulla eros ut dui. pellentesque odio massa, rutrum nec magna ac, viverra pulvinar magna. duis ultricies, ligula id vulputate eleifend, metus leo semper ante, sed suscipit lacus odio no

In [14]:
tokens = word_tokenize(raw_text.lower())
text = nltk.Text(tokens)

In [15]:
print text[:10]    # Only show the first 10 tokens (0-9) in the document.


[u'lorem', u'ipsum', u'dolor', u'sit', u'amet', u',', u'consectetur', u'adipiscing', u'elit', u'.']

Now notice that the token Fusce (with the upper-case F) does not appear in the document; all occurrences of Fusce have bee normalized to fusce (with a lowercase f).


In [16]:
text.dispersion_plot(['Fusce', 'fusce'])


Stemming

Stemming is another useful normalization procedure. Sometimes (depending on our research question), distinguishing word forms like "evolutionary" and "evolution" is not desirable. Stemming basically involves chopping off affixes so that only the "root" or "stem" of the word remains.

The NLTK package provides several different stemmers.


In [18]:
raw_text = nltk.load('../data/abstract.txt')
tokens = word_tokenize(raw_text.lower())
print tokens[:50]    # Only the first 50 tokens.


[u'this', u'paper', u'describes', u'the', u'preliminary', u'results', u'of', u'three', u'experiments', u'in', u'subjective', u'probability', u'forecasting', u'which', u'were', u'recently', u'conducted', u'in', u'four', u'weather', u'service', u'forecast', u'offices', u'(', u'wsfos', u')', u'of', u'the', u'national', u'weather', u'service', u'.', u'the', u'first', u'experiment', u',', u'which', u'was', u'conducted', u'at', u'the', u'st.', u'louis', u'wsfo', u',', u'was', u'designed', u'to', u'investigate', u'both']

Here's an example of normalization with the Porter stemmer, which has been around since the late 1970s:


In [19]:
porter = nltk.PorterStemmer()
print [porter.stem(token) for token in tokens][:50]


[u'thi', u'paper', u'describ', u'the', u'preliminari', u'result', u'of', u'three', u'experi', u'in', u'subject', u'probabl', u'forecast', u'which', u'were', u'recent', u'conduct', u'in', u'four', u'weather', u'servic', u'forecast', u'offic', u'(', u'wsfo', u')', u'of', u'the', u'nation', u'weather', u'servic', u'.', u'the', u'first', u'experi', u',', u'which', u'wa', u'conduct', u'at', u'the', u'st.', u'loui', u'wsfo', u',', u'wa', u'design', u'to', u'investig', u'both']

...and the Lancaster stemmer:


In [20]:
lancaster = nltk.LancasterStemmer()
print [lancaster.stem(token) for token in tokens][:50]


[u'thi', u'pap', u'describ', u'the', u'prelimin', u'result', u'of', u'three', u'expery', u'in', u'subject', u'prob', u'forecast', u'which', u'wer', u'rec', u'conduc', u'in', u'four', u'weath', u'serv', u'forecast', u'off', u'(', u'wsfos', u')', u'of', u'the', u'nat', u'weath', u'serv', u'.', u'the', u'first', u'expery', u',', u'which', u'was', u'conduc', u'at', u'the', u'st.', u'lou', u'wsfo', u',', u'was', u'design', u'to', u'investig', u'both']

...and the Snowball stemmer (which is actually a collection of stemmers for a whole bunch of different languages):


In [21]:
snowball = nltk.SnowballStemmer("english")
print [snowball.stem(token) for token in tokens][:50]


[u'this', u'paper', u'describ', u'the', u'preliminari', u'result', u'of', u'three', u'experi', u'in', u'subject', u'probabl', u'forecast', u'which', u'were', u'recent', u'conduct', u'in', u'four', u'weather', u'servic', u'forecast', u'offic', u'(', u'wsfos', u')', u'of', u'the', u'nation', u'weather', u'servic', u'.', u'the', u'first', u'experi', u',', u'which', u'was', u'conduct', u'at', u'the', u'st.', u'loui', u'wsfo', u',', u'was', u'design', u'to', u'investig', u'both']

Lemmatization

An alternate approach to stemming is to lemmatize tokens. Instead of chopping up any old word that it encounters, the WordNet lemmatizer tries to match tokens with known words in the WordNet lexical database, and then convert them to a known "lemma" ("lexicon headword"). If the lemmatizer encounters a token that it does not recognize, it will just leave the token as-is.


In [22]:
wordnet = nltk.WordNetLemmatizer()
print [wordnet.lemmatize(token) for token in tokens][:50]


[u'this', u'paper', u'describes', u'the', u'preliminary', u'result', u'of', u'three', u'experiment', u'in', u'subjective', u'probability', u'forecasting', u'which', u'were', u'recently', u'conducted', u'in', u'four', u'weather', u'service', u'forecast', u'office', u'(', u'wsfos', u')', u'of', u'the', u'national', u'weather', u'service', u'.', u'the', u'first', u'experiment', u',', u'which', u'wa', u'conducted', u'at', u'the', u'st.', u'louis', u'wsfo', u',', u'wa', u'designed', u'to', u'investigate', u'both']

Side-by-side comparisons

Here's a side-by-side comparison of the stemmers shown above.


In [38]:
stemmers = [porter.stem, lancaster.stem, snowball.stem, wordnet.lemmatize]
stemmer_labels = ['original', 'porter', 'lancaster', 'snowball', 'wordnet']
print '\t'.join([t.ljust(8) for t in stemmer_labels])
print '--'*40
for token in tokens[:30]:
    print token.ljust(8),
    for stemmer in stemmers:
        print '\t', stemmer(token).ljust(8),
    print


original	porter  	lancaster	snowball	wordnet 
--------------------------------------------------------------------------------
this     	thi      	thi      	this     	this    
paper    	paper    	pap      	paper    	paper   
describes 	describ  	describ  	describ  	describes
the      	the      	the      	the      	the     
preliminary 	preliminari 	prelimin 	preliminari 	preliminary
results  	result   	result   	result   	result  
of       	of       	of       	of       	of      
three    	three    	three    	three    	three   
experiments 	experi   	expery   	experi   	experiment
in       	in       	in       	in       	in      
subjective 	subject  	subject  	subject  	subjective
probability 	probabl  	prob     	probabl  	probability
forecasting 	forecast 	forecast 	forecast 	forecasting
which    	which    	which    	which    	which   
were     	were     	wer      	were     	were    
recently 	recent   	rec      	recent   	recently
conducted 	conduct  	conduc   	conduct  	conducted
in       	in       	in       	in       	in      
four     	four     	four     	four     	four    
weather  	weather  	weath    	weather  	weather 
service  	servic   	serv     	servic   	service 
forecast 	forecast 	forecast 	forecast 	forecast
offices  	offic    	off      	offic    	office  
(        	(        	(        	(        	(       
wsfos    	wsfo     	wsfos    	wsfos    	wsfos   
)        	)        	)        	)        	)       
of       	of       	of       	of       	of      
the      	the      	the      	the      	the     
national 	nation   	nat      	nation   	national
weather  	weather  	weath    	weather  	weather 

In [39]:
words = ['sustain', 'sustenance', 'sustaining', 'sustains', 
         'sustained', 'sustainable', 'sustainability']
stemmers = [porter.stem, lancaster.stem, snowball.stem, wordnet.lemmatize]
stemmer_labels = ['original', 'porter', 'lancaster', 'snowball', 'wordnet']
print '\t'.join([t.ljust(8) for t in stemmer_labels])
print '--'*40
for token in words:
    print token.ljust(8),
    for stemmer in stemmers:
        print '\t', stemmer(token).ljust(8),
    print


original	porter  	lancaster	snowball	wordnet 
--------------------------------------------------------------------------------
sustain  	sustain  	sustain  	sustain  	sustain 
sustenance 	susten   	sust     	susten   	sustenance
sustaining 	sustain  	sustain  	sustain  	sustaining
sustains 	sustain  	sustain  	sustain  	sustains
sustained 	sustain  	sustain  	sustain  	sustained
sustainable 	sustain  	sustain  	sustain  	sustainable
sustainability 	sustain  	sustain  	sustain  	sustainability

In [48]:
wordnet.lemmatize('knows'), wordnet.lemmatize('sleeps'), wordnet.lemmatize('types')


Out[48]:
(u'know', u'sleep', u'type')

Filtering

Finally, depending on our analysis, we may want to filter out various tokens. For example, we may wish to exclude common words like pronouns. The easiest way to remove these kinds of tokens is by using a stoplist. A stoplist is simply a list of undesirable words.

NLTK provides stoplist containing around 2,400 words from 11 different languages.


In [24]:
from nltk.corpus import stopwords

Here are the English stopwords:


In [25]:
stoplist = stopwords.words('english')
print stoplist


[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'no', u'nor', u'not', u'only', u'own', u'same', u'so', u'than', u'too', u'very', u's', u't', u'can', u'will', u'just', u'don', u'should', u'now', u'd', u'll', u'm', u'o', u're', u've', u'y', u'ain', u'aren', u'couldn', u'didn', u'doesn', u'hadn', u'hasn', u'haven', u'isn', u'ma', u'mightn', u'mustn', u'needn', u'shan', u'shouldn', u'wasn', u'weren', u'won', u'wouldn']

We can filter out tokens in the stoplist like so:


In [26]:
print [token for token in tokens if token not in stoplist][:20]


[u'paper', u'describes', u'preliminary', u'results', u'three', u'experiments', u'subjective', u'probability', u'forecasting', u'recently', u'conducted', u'four', u'weather', u'service', u'forecast', u'offices', u'(', u'wsfos', u')', u'national']

We may also want to remove any punctuation tokens (note the parentheses, above). We can do that using the isalpha() method.


In [27]:
print [token for token in tokens if token.isalpha()][:20]


[u'this', u'paper', u'describes', u'the', u'preliminary', u'results', u'of', u'three', u'experiments', u'in', u'subjective', u'probability', u'forecasting', u'which', u'were', u'recently', u'conducted', u'in', u'four', u'weather']

Putting it all together

Once you have decided on a strategy for tokenizing, normalizing, and filtering your texts, it's a good idea to encapsulate that logic in functions that we can apply as needed.


In [28]:
def normalize_token(token):
    """
    Convert token to lowercase, and stem using the Porter algorithm.
    
    Parameters
    ----------
    token : str
    
    Returns
    -------
    token : str
    """
    return porter.stem(token.lower())

In [29]:
def filter_token(token):
    """
    Evaluate whether or not to retain ``token``.
    
    Parameters
    ----------
    token : str
    
    Returns
    -------
    keep : bool
    """
    token = token.lower()
    return token not in stoplist and token.isalpha() and len(token) > 3

Let's compare the effect on our token distributions. Here's the top 20 tokens when we tokenize only:


In [30]:
unprocessed_text = nltk.Text(word_tokenize(raw_text))
unprocessed_text.plot(20)


And here's the top 20 tokens when we apply our processing pipeline:


In [31]:
proccessed_text = nltk.Text([normalize_token(token) for token 
                             in word_tokenize(raw_text)
                             if filter_token(token)])
proccessed_text.plot(20)